I wanted to build on my previous hypothesis that distance could be used to predict the fare of a route by adding the number of passengers who fly the route per day on average. I feel like more popular flights would be cheaper than those with low flight traffic. Additionally I felt that the best 3rd explanatory variable to include in this analysis was the relationship between distance and passengers. The other options based on the provided dataset just didn’t seem to mesh as well with the two that I have included already.
\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{distance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{passen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{dist:passen} + \epsilon_i \]
Below is the Multiple regression result using distance, passengers, and the distance/passenger relationship.
lm.mult <-lm(fare ~ dist + passen + dist:passen, data=IO_airfare)
summary(lm.mult) %>%
pander(caption= "HW 3 Simple Multiple regression results")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 118.7 | 2.046 | 58 | 0 |
| dist | 0.06534 | 0.001711 | 38.2 | 1.803e-277 |
| passen | -0.0212 | 0.001724 | -12.3 | 3.192e-34 |
| dist:passen | 1.537e-05 | 1.51e-06 | 10.18 | 4.511e-24 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 57.62 | 0.4083 | 0.4079 |
Below is the Multiple regression result using distance, passengers, but without the distance/passenger relationship.
lm.mult2 <-lm(fare ~ dist + passen, data=IO_airfare)
summary(lm.mult2) %>%
pander(caption= "HW 3 Simple Multiple regression w/o Interaction")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 108.8 | 1.823 | 59.69 | 0 |
| dist | 0.07541 | 0.001411 | 53.45 | 0 |
| passen | -0.007297 | 0.001063 | -6.864 | 7.574e-12 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 58.26 | 0.395 | 0.3947 |
Here are the results from HW 2 regression, prediction of fare using just distance.
lm.sim <-lm(fare ~ dist, data=IO_airfare)
summary(lm.sim) %>%
pander(caption= "HW 2 simple regression results")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 103.3 | 1.643 | 62.87 | 0 |
| dist | 0.07631 | 0.001412 | 54.05 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 58.55 | 0.3888 | 0.3886 |
Here is the original equation for the regression with the appropriate coefficients now included.
\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{118.7}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{0.06534}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{distance} + \overbrace{-0.0212}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{passen} + \overbrace{1.537e-05}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{dist:passen} + \epsilon_i \]
#b <- coef(lm.mult)
## Hint: library(car) has a scatterplot 3d function which is simple to use
# but the code should only be run in your console, not knit.
#library(car)
#scatter3d(fare ~ dist + passen, data=IO_airfare)
## To embed the 3d-scatterplot inside of your html document is harder.
#Perform the multiple regression
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.5
#Setup Axis
axis_x <- seq(min(IO_airfare$dist), max(IO_airfare$dist), by = graph_reso)
axis_y <- seq(min(IO_airfare$passen), max(IO_airfare$passen), by = graph_reso)
#Sample points
lmnew <- expand.grid(dist = axis_x, passen = axis_y, KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm.mult, newdata = lmnew)
lmnew <- acast(lmnew, passen ~ dist, value.var = "Z") #y ~ x
#Create scatterplot
plot_ly(IO_airfare,
x = ~dist,
y = ~passen,
z = ~fare,
text = rownames(IO_airfare),
type = "scatter3d",
mode = "markers", color = ~fare)
#add_trace(z = lmnew,
# x = axis_x,
# y = axis_y,
# type = "surface")
Based on the multiple regression, the base cost of a ticket would be $118.70, for each additional mile the fare would increase by $0.065 and for each additional passenger the fare would decrease by $0.021. The strength or the relationship between Distance and passengers is ~0. The P-values for each of these terms are all incredibly close to 0. Although it is worth noting that the probability of the distance variable (B1)is significantly lower than that of the passengers (B2) or relationship (B3), it is much more powerful in estimating fare than passenger count or the relationship.
These relationships are visible best when viewing the 3d plot. It is quickly apparent that distance is a signicant estimator due to the clustering of points along its distance plane.
Assuming that our sample is random the following Q-Q plots aid in examaining the residuals of our points. The first primarily helps to show if variance remains constant across our variables. The second shows some minor signs of right skewness, I believe this is mainly due to the fact that my regression fails to predict base costs of a flight leaving these values up to B0. The third and final plot helps determine if the order of the data is important, usually this is needed for time sorted data but I noticed this set is sorted alphabetically by origin point so I included this to see if any patterns presented themselves.
From these plots my only concerns are that more data closer to the extremes of the dataset would be helpful, additionally some metric that could be used to help account for variance in base costs (such as crew count or time of year) would be great.
par(mfrow=c(1,3))
plot(lm.mult,which=1:2)
plot(lm.mult$residuals)
This could lead to Bias in OLS estimations. It shows that our model experiences non-constant variance over our independent variables.
This could also lead to Bias in OLS estimations. Lingering relationships could cause our model to rely too heavily on provided variables to make an estimate. Omission of relevant variables can also result in hidden patterns within a model.
This could be a potential source of but may requite a bit more information. A correlation of 0.395 between two variables is only weak-moderate at best and likely would not hold major impact on our output. Although a more detailed analysis of this relationship and its potential causes and relative impacts on the model output would be worth the effort. This could be a sign of co-linearity but it may just be correlation without causation.
How can it be that the R2 is smaller when the variable age is added to the equation?
There are a few possible explanations for this occurrence, a few of which could be overcrowding of variables, or occurrences such as heteroskedastiscity. Given that current variables being used I believe that the likely cause of this result is the probable relationship between rank and age within the model that has not been accounted for properly with usage of an interaction term. Another possible reason is that the predictive effect of this new variable is less than that of chance alone, this will cause its addition to ironically reduce the r^2 value.